Overview of IPython.parallel

Introduction

IPython is a tool for interactive and exploratory computing. We have seen that IPython's kernel provides a mechanism for interactive remote computation, and we have extended this same mechanism for interactive remote parallel computation, simply by having multiple kernels.

Architecture overview

At a high level, there are three basic components to parallel IPython:

  • Engine(s) - the remote or distributed processes where your code runs.
  • Client - your interface to running code on Engines.
  • Controller - the collection of processes that coordinate Engines and Clients.

These components live in the IPython.parallel package and are installed with IPython.

This leads to a usage model where you can create as many engines as you want per compute node, and then control them all from your clients via a central 'controller' object that encapsulates the hub and schedulers:

IPython engine

The Engine is simply a remote Python namespace where your code is run, and is identical to the kernel used elsewhere in IPython.

It can do all the magics, pylab integration, and everything else you can do in a regular IPython session.

IPython controller

The Controller is a collection of processes:

  • Schedulers relay messages between Engines and Clients.
  • The Hub monitors the cluster state.

Together, these processes provide a single connection point for your clients and engines. Each Scheduler is a small GIL-less function in C provided by pyzmq (the Python load-balancing scheduler being an exception).

The Hub can be viewed as an über-logger, which monitors all communication between clients and engines, and can log to a database (e.g. SQLite or MongoDB) for later retrieval or resubmission. The Hub is not involved in execution in any way, and a slow Hub cannot slow down submission of tasks.

Schedulers

All actions that can be performed on the engine go through a Scheduler. While the engines themselves block when user code is run, the schedulers hide that from the user to provide a fully asynchronous interface to a set of engines.

IPython client and views

There is one primary object, the Client, for connecting to a cluster. For each execution model there is a corresponding View, and you determine how your work should be executed on the cluster by creating different views or manipulating attributes of views.

The two basic views:

  • The DirectView class for explicitly running code on particular engine(s).
  • The LoadBalancedView class for destination-agnostic scheduling.

You can use as many views of each kind as you like, all at the same time.

Getting Started

Starting the IPython controller and engines

The quickest way to get started is to visit the 'clusters' tab, and start some engines with the 'default' profile.

To follow along with this tutorial, you will need to start the IPython controller and some IPython engines. The simplest way of doing this is visit the 'clusters' tab, and start some engines with the 'default' profile, or to use the ipcluster command:

$ ipcluster start -n 4

There isn't time to go into it here, but ipcluster can be used to start engines and the controller with various batch systems including:

  • SGE
  • PBS
  • LSF
  • MPI
  • SSH
  • WinHPC

More information on starting and configuring the IPython cluster in the IPython.parallel docs.

Once you have started the IPython controller and one or more engines, you are ready to use the engines to do something useful.

To make sure everything is working correctly, let's do a very simple demo:


In [1]:
from IPython import parallel
rc = parallel.Client()
rc.block = True

In [2]:
rc.ids


Out[2]:
[0, 1, 2, 3]

Let's define a simple function


In [3]:
def mul(a,b):
    return a*b

In [4]:
mul(5,6)


Out[4]:
30

Create some Views


In [5]:
dview = rc[:]
dview


Out[5]:
<DirectView [0, 1, 2, 3]>

In [6]:
e0 = rc[0]
e0


Out[6]:
<DirectView 0>

What does it look like to call this function remotely?

Just turn f(*args, **kwargs) into view.apply(f, *args, **kwargs)!


In [7]:
e0.apply(mul, 5, 6)


Out[7]:
30

And the same thing in parallel?


In [8]:
dview.apply(mul, 5, 6)


Out[8]:
[30, 30, 30, 30]

Python has a builtin map for calling a function with a sequence of arguments


In [9]:
map(mul, range(1,10), range(2,11))


Out[9]:
[2, 6, 12, 20, 30, 42, 56, 72, 90]

So how do we do this in parallel?


In [10]:
dview.map(mul, range(1,10), range(2,11))


Out[10]:
[2, 6, 12, 20, 30, 42, 56, 72, 90]

We can also run code in strings with execute


In [11]:
dview.execute("import os")
dview.execute("a = os.getpid()")


Out[11]:
<AsyncResult: finished>

And treat the view as a dict lets you access the remote namespace:


In [12]:
dview['a']


Out[12]:
[24377, 24380, 24381, 24384]

AsyncResults

When you do async execution, the calls return an AsyncResult object immediately


In [13]:
def wait_here(t):
    import time, os
    time.sleep(t)
    return os.getpid()

In [14]:
ar = dview.apply_async(wait_here, 2)

In [15]:
ar.get()


Out[15]:
[24377, 24380, 24381, 24384]

In [ ]: